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Abstract 

Nonnegative matrix factorization (NMF) is a data analysis technique used in a great variety 
of applications such as text mining, image processing, hyperspectral data analysis, computational 
biology, and clustering. In this paper, we consider two well-known algorithms designed to solve 
NMF problems, namely the multiplicative updates of Lee and Seung and the hierarchical alternating 
least squares of Cichocki et al. We propose a simple way to significantly accelerate these schemes, 
based on a careful analysis of the computational cost needed at each iteration, while preserving 
their convergence properties. This acceleration technique can also be applied to other algorithms, 
which we illustrate on the projected gradient method of Lin. The efficiency of the accelerated 
algorithms is empirically demonstrated on image and text datasets, and compares favorably with 
a state-of-the-art alternating nonnegative least squares algorithm. 

Keywords: nonnegative matrix factorization, algorithms, multiplicative updates, hierarchical al- 
ternating least squares. 
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1 Introduction 

Nonnegative matrix factorization (NMF) consists in approximating a nonnegative matrix M as a low- 
rank product of two nonnegative matrices W and H, i.e., given a matrix M S M^^" and an integer 
r < min{m, n}, find two matrices W G M.'^'^'' and H G M^j.^" such that WH f» M. 

With a nonnegative input data matrix M, nonnegativity constraints on the factors W and H 
are well-known to le ad to low-rank deco mpositions with bet ter interpretation i n many applications 
such as t ext mining ( Shahnaz et al.l . l2006l ) , image proc essing ( Lee &: Seuna . Il999l ) , hypers pectral data 
analy sis (jPauca et al.l . l2006l ). computational biology ( DevaraiarJ . l2008l ). and clustering ( Ding et al 



20051 ). U nfortunately, i mposing these constraints is also known to render the problem computationally 
difHcult (|Vavasigl . [2009l 'l. 

Since an exact low-rank representation of the input matrix does not exist in general, the quality 
of the approximation is measured by some criterion, typically the sum of the squares of the errors on 
the entries, which leads to the following minimization problem: 
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where ||^||f = (Sji^fi)^ denotes the Frobenius norm of matrix A. Most NMF algorithms are 



iterative, and exploit the fact that (jNMFp reduces to an efficiently solvable convex nonnegative least 
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squares problem (NNLS) when one of the factors W or H is fixed. Actually, it seems that nearly all 
algorithms proposed for NMF adhere to the following general framework 

(0) Select initial matrices {W^^' , H^^') (e.g., randomly). Then for A; = 0, 1, 2, . . . , do 

(a) Fix if(*=) and find T^^'^+i) > such that \\M - Ty('=+i)ij('=)|||, < \\M - W^'''>H^'''>\\j,. 

(b) Fix Ty(*=+i) and find ij(^+i) > such that \\M - ]/r('^'+i)ij(^+i)||2, < ||M - W^''+'^^H('''>\\1. 

More precisely, at each iteration, one of the two factors is fixed and the other is updated in such 
a way that the objective function is reduced, which amounts to a two-block coordinate descent 
method. Notice that the role of matrices W and H is perfectly symmetric: if one transposes in- 
put matrix M, the new matrix M has to be approximated by a product H W , so that any formula 
designed to update the first factor in this product directly translates into an update for the sec- 
ond factor in the original problem. Formally, if the update performed in step (a) is described by 
^(«+i) = update(M, M^' •*, ii^'^), an algorithm preserving symmetry will update the factor in step 
(b) according to H^^^^> = n'p(iSiie{M'^ ,H^^>'^ ,W^^~^^>'^)'^ . In this paper, we only consider such sym- 
metrical algorithms, and focus on the update of matrix W . 

This update can be carried out in many different ways: the most natural possibility is to compute 
an optimal solution for the NNLS subproblem, which leads to a class of algorithms called alternat- 
ing nonnegative least squares (ANLS), see, e.g.. iH.Kim k. ParkI (120081). However, this computat ion, 



which can be performed with active-set-like methods ( H.Kim &: ParkI . l2008l : Ij.Kim &: ParkI . l2008l ) , is 
relatively costly. Therefore, since an optimal solution for the NNLS problem corresponding to one 
factor is not required before the update of the other factor is performed, several algorithms only com- 
pute an approximate solution of the NNLS subproblem, sometimes very roughly, but with a cheaper 
computational cost, leading to an inexact two-block coordinate descent scheme. We now present two 
such procedures: the multiplicative updates of Lee and Seung and the hierarchical alternating least 
squares of Cichocki et al. 

In their seminal papers. iLee &: Seund (jl999l . l200lh introduce the multiplicative updates: 



^(fc+i) =yuj{M,w'^^\H 



(kh 



W^^K 



[M//(^)^ 



[W(k)fj{k)ff{k)^ 



where o (resp. |4) denotes the component- wise product (resp. division) of matrices, and prove that each 
update monotonically decreases the Frobenius norm of the error ||M— VFff NF, i.e., satisfies the descrip- 
tion o f steps (a) and (b). This technique was actually originally proposed bv lDaube-Witherspoon &: Mue 
( 19861 ) to solve NNLS problems. The popularity of this algorithm came along with the popularity of 
NMF and rnany a ut hors have studied or used this algorithm or variants to compute NMF's, see, e.g., 
Berrv et al.l ( 20071 ): ICichocki et al.l ( 20091 ) and the references therein. In particular, the MATLAB® 
Statistics Toolbox implements this method. 

However, M U have been obse r ved to converge re l ativel y slowly, especially when dealing with dense 
matrices M, see lHan et al.l ( 20091 ) : iGillis fc Glineuri ( 20081 ) and the references therein, and many other 
algorithms hav e been s ubsequently intro d uced which perform better in most situations. Fo r example , 
Cichocki et al.l (120071'): ICichocki &: PhanI (|2009| ) and, independently, several other authors (jHol . l2008l : 



Gillis &: GlineurL I2OO8I : iLi fc Zhand . l2009l ) proposed a technique called hierarchical alternating least 



squares (HALSJ^, which successively updates each column of W with an optimal and easy to compute 
closed-form solution. In fact, when fixing all variables but a single column W-p of W, the problem 
reduces to 



minJiM - WH\\l = \\{M -Y^WaHi) - W..pHp,\\l = Y,\\{Mi., -Y^WuHt) - WipHp.,\\j,. 



Hoi (|2008l ) refers to HALS as rank-one residue iteration (RRI), and JLi fc Zhand (|2009l ) as FastNMF. 



Because each row of W only affects the corresponding row of the product WH, this problem can be 
further decoupled into m independent quadratic programs in one variable Wip, corresponding to the 
i row of M. The optimal solution W* of these subproblems can be easily written in closed-form 



T 



Wi„ = max 
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tT. 



= max 0, \^rr, , 1 < z < m. 

Hence HALS updates successively the columns of W , so that VFC'+i) = HALS(M, W'^^\li'^^'^) can be 
computed in the following way: 



VF,Jf+^) = max {^, 



^■■v - ml Wp'^B,, - EUp^, wf^B,p 



J^pp 



successively for p = 1,2, ... ,r, where A = MH^ ' and B = H^ 'H^ ' . This amounts to approxi- 
mately solving each NNLS subproblem in W with a single complete round of an exact block-coordinate 
descent method with r blocks of m variables corresponding to the columns of W (notice that any other 
ordering for the update of the columns of W is also possible). 

Other approa ches based on iterative methods to s olve the NNLS subproblems include proje cted 
gra dient descent (Lin . 2007al ) or Newton-like methods (JDhillon et al.l . 120071 : ICichocki et al.l . l2006l ) : see 



also ICichocki et al.l (J2009l ) and the references therein. 



We first analyze in Section [2] the computational cost needed to update the factors W in MU and 
HALS, then make several simple observations leading in Section [3] to the design of accelerated versions 
of these algorithms. These improvements can in principle be applied to any two-block coordi nate 



descent NMF algorithm, as demonstrated in Section 13.41 on the projected gradient method of [Lin 



(|2007al ). We mainly focus on MU, because it is by far the most popular NMF algorithm, and on 
HALS, because it is very efficient in practice. Section 2] studies convergence of the accelerated variants 
to stationary points, and shows that they preserve the properties of the original schemes. In Section [5l 
we experimentally demonstrate a significant acceleration in convergence on several irnage an d text 
datasets, with a comparison with the state-of-the-art ANLS algorithm of lJ.Kim Sz ParkI ( 20081 ). 



2 Analysis of the Computational Cost of Factor Updates 

In order to make our analysis valid for both dense and sparse input matrices, let us introduce a 
parameter K denoting the number of nonzero entries in matrix M [K = mn when M is dense). 
Factors W and H are typically stored as dense matrices throughout the execution of the algorithms. 
We assume that NMF achieves compression, which is often a requirement in practice. This means 
that storing W and H must be cheaper than storing M: roughly speaking, the number of entries in 
W and H must be smaller than the number of nonzero entries in M, i.e., r{m + n) < K. 

Descriptions of Algorithms [1] and [2] below provide separate estimates for the number of floating 
point operations (flops) in each matrix product computation needed to update factor W in MU and 
HALS. One can check that the proposed organization of the different matrix computations (and, in 
particular, the ordering of the matrix products) minimizes the total computational cost (for example, 
starting the computation of the MU denominator WHH^ with the product WH is clearly worse than 
with HH'^). 

MU and HALS possess almost exactly the same computational cost (the difference being a typically 
negligible mr flops) . It is particularly interesting to observe that 



Algorithm 1 MU update for W^^^) 



1: A = MH^'"'^^; -^ 2Kr flops 

B = i/(fc)i7(fc)^; -^ 2nr2 flops 



C = H^(^)5; -^ 2mr2 flops 

^(fc+i) = ^(fc) o |A|. ^ 2mr flops 

% Total: r(2K + 2nr + 2mr + 2m) flops 



Algorithm 2 HALS update for W"(^) 



1: ^ = MH^^^'^; -^ 2Kr flops 

S = H^k)jj{kf.^ _^ 2nr^ flops 

for i = 1, 2, . . . ,r do 

C;fe = ECl' W^:f ^'^^ife + E[=p+i W^I'^Bik-, ^ 2m(r - 1) flops 

5: W:k = max (o, '^%'^''' ) ; ^ 3m flops 

6: end for 



% Total: r(2K + 2nr + 2mr + m) flops 



1. Steps 1. and 2. in both algorithms are identical and do not depend on the matrix W^^'; 

2. Recalling our assumption K > r{7n + n), computation oi MH^ > (step 1.) is the most expensive 
among all steps. 

Therefore, this time-consuming step should be performed sparingly, and we should take full advantage 
of having computed the relatively expensive MH^ ' and H^ 'H^ ' matrix products. This can be 
done by updating W^^' several times before the next update of H^^\ i.e., by repeating steps 3. and 
4. in MU (resp. steps 3. to 6. in HALS) several times after the computation of matrices MH^^> and 
H^ 'H^ ' . In this fashion, better solutions of the corresponding NNLS subproblems will be obtained 
at a relatively cheap additional cost. 

The original MU and HALS algorithms do not take advantage of this fact, and alternatively update 
matrices W and H only once per (outer) iteration. An important question for us is now: how many 
times should we update W per outer iteration?, i.e., how many inner iterations of MU and HALS 
should we perform? This is the topic of the next section. 

3 Stopping Criterion for the Inner Iterations 

In this section, we discuss two different strategies for choosing the number of inner iterations: the 
first uses a fixed number of inner iterations determined by the flop counts, while the second is based 
on a dynamic stopping criterion that checks the difference between two consecutive iterates. The first 
approach is shown empirically to work better. We also describe a third hybrid strategy that provides 
a further small improvement in performance. 

3.1 Fixed Number of Inner Iterations 

Let us focus on the MU algorithm (a completely similar analysis holds for HALS, as both methods 
differ only by a negligible number of flops). Based on the flops counts, we estimate how expensive the 
first inner update of W would be relatively to the next ones (all performed while keeping H fixed), 



which is given by the following factor pw (the corresponding value for H will be denoted by pn] 



2Kr + 2nr^ + 2mr^ + 2mr K + nr ( K + 



PW = 7, ^T-T, = 1 + \ • [PH = 1 + 



■ mr 



2mr'^ _j_ 2mr mr + m \ nr + n 

Values of pw and pH for several datasets are given in Section [5l see Tables [1] and [2j 

Notice that for K > r{m + n), we have pw ^ "^v+l ^^ that the first inner update of W is at 
least about twice as expensive as the subsequent ones. For a dense matrix, K is equal to mn and we 
actually have that pw = 1 + ^7^, H > 1 + ^tti) which is typically quite large since n is often much 
greater than r. This means for example that, in our accelerated scheme, W could be updated about 
1 + p-\^ times for the same computational cost as two independent updates of W in the original MU. 
A simple and natural choice consists in performing inner updates of W and H a fixed number of 
times, depending on the values of pw and pn- Let us introduce a parameter a > such that W is 
updated (1 + apw) times before the next update of H, and H is updated (1 + apn) times before 
the next update of W . Let us also denote the corresponding algorithm MUq (MUq reduces to the 
original MU). Therefore, performing (1 + apw) inner updates of W in MUq has approximately the 
same computational cost as performing (1 + a) updates of W in MUq- 

In order to find an appropriate value for parameter a, we have performed some preliminary tests 
on image and text datasets. First, let us denote e(t) the Frobenius norm of the error \\M — WH\\f 
achieved by an algorithm within time t, and define 

where e(0) is the error of the initial iterate (W^^\H^^'), and emin is the smallest error observed 
among all algorithms across all initializations. Quantity E{t) is therefore a normalized measure of 
the improvement of the objective function (relative to the initial gap) with respect to time; we have 
< E{t) < 1 for monotonically decreasing algorithms (such as MU and HALS). The advantage of 
E(t) over e(t) is that one can meaningfully take the average over several runs involving different 
initializations and datasets, and display the average behavior of a given algorithm. 

Figure [T] displays the average of this function E{t) for dense (on the left) and sparse (on the right) 
matrices using the datasets described in Section [5] for five values of a = 0,0.5,1,2,4. We observe 
that the original MU algorithm (a = 0) converges significantly less rapidly than all the other tested 
variants (especially in the dense case). The best value for parameter a seems to be 1. 

Figure [2] displays the same computational experiments for HALyj. HALS with a = 0.5 performs 
the best. For sparse matrices, the improvement is harder to discern (but still present); an explanation 
for that fact will be given at the end of Section | 



3.2 Dynamic Stopping Criterion for Inner Iterations 

In the previous section, a fixed number of inner iterations is performed. One could instead consider 
switching dynamically from one factor to the other based on an approp riate criteri on. For example, 
it is possible to use the norm of the projected gradient as proposed bv iLinI ( 2007al ). A simpler and 



cheaper possibility is to rely solely on the norm of the difference between two iterates. Noting W^'^''"' 
the iterate after I updates of W^''' (while H^^' is being kept fixed), we stop inner iterations as soon as 

||p^(fc,m) _ ^W)||^ < e||-p^(fc.i) _ w'^'^'^^Wf, (3.2) 



^Because HALS involves a loop over the columns of W and rows of H, we observed that an update of HALS is 
noticeably slower than an update of MU when using MATLAB® (especially for r ^ 1), despite the quasi-equivalent 
theoretical computational cost. Therefore, to obtain fair results, we adjusted pw and pH by measuring directly the ratio 
between time spent for the first update and the next one, using the cputime function of MATLAB® . 
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Figure 1: Average of functions E{t) for MU using different values of a: (left) dense matrices, (right) 
sparse matrices, computed over 4 image datasets and 6 text datasets, using two different values for 
the rank for each dataset and 10 random initializations, see Section O 
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Figure 2: Average of functions E{t) for HALS using different values of a: (left) dense matrices, (right) 
sparse matrices. Same settings as in Figure [Ij 



i.e., as soon as the improvement of the last update becomes negligible compared to the one obtained 
with the first update, but without any a priori fixed maximal number of inner iterations. 

Figures [3] shows the results for MU with different values of e (we also include the original MU and 
MU with a = 1 presented in the previous section to serve as a comparison) . Figures [3] displays the 
same experiment for HALS. 

In both cases, we observe that the dynamic stopping criterion is not able to outperform the 
approach based on a fixed number of inner iterations (a = 1 for MU, a = 0.5 for HALS). Moreover, 
in the experiments for HALS with sparse matrices, it is not even able to compete with the original 
algorithm. 
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Figure 3: Average of functions E{t) for MU using different values of e, with a = and a = 1 for 
reference (see Section [3. ip : (left) dense matrices, (right) sparse matrices. Same settings as in Figured! 
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Average of functions E{t) for HALS using different values of e, with a = and a = 0.5 for 
(see Section [^TT]) : (left) dense matrices, (right) sparse matrices. Same settings as in Figure [H 



3.3 A Hybrid Stopping Criterion 

We have shown in the previous section that using a fixed number of inner iterations works better than a 
stopping criterion based solely on the difference between two iterates. However, in some circumstances, 
we have observed that inner iterations become ineffective before their maximal count is reached, so 
that it would in some cases be worth switching earlier to the other factor. 

This occurs in particular when the numbers of rows m and columns n of matrix M have different 
orders of magnitude. For example, assume without loss of generality that m <C n, so that we have 
pw ^ Ph- Hence, on the one hand, matrix W has significantly less entries than H (mr <C nr), 
and the corresponding NNLS subproblem features a much smaller number of variables; on the other 
hand, pw ^ Ph so that the above choice will lead many more updates of W performed. In other 



words, many more iterations are performed on the simpler problem, which might be unreasonable. 
For example, for the CBCL face database (cf. Section [S]) with m = 361, n = 2429 and r = 20, we 
have ph ~ 18 and pw ~ 123, and this large number of inner VF-updates is typically not necessary to 
obtain an iterate close to an optimal solution of the corresponding NNLS subproblem. 



Therefore, to avoid unnecessary inner iterations, we propose to combine the fixed number of inner 
iterations proposed in Section [3. II with the supplementary stopping criterion described in Section [3.21 
This safeguard procedure will stop the inner iterations before their maximum number [1 + apw\ is 
reached when they become ineffective (depending on parameter e, see Equation ()3.2p ). Algorithm [3] 
displays the pseudocode for the corresponding accelerated MU, as well as a similar adaptation for 
HALS. Figures E] and O displays the numerical experiments for MU and HALS respectively. 



Algorithm 3 Accelerated MU and HALS 



and initial iterates (Ty(°),ij(°)) G 



hrxn 



Require: Data matrix M € M' 
1: for k = 0,1,2,... do 

2: Compute A = MH('''>^ and B = H^k)jj{kf.^ ^(fc,o) ^ ^(fc). 

3: for 1 = 1: [1 + apw\ do 

4: Compute T^^*^'') using either MU or HALS (cf. Algorithms [1] and [2]) ; 

5: if ||P^(^-'') - ^('^■'-i)||i. < eWW^^'^^i - ^C^'O^IIf then 

6: break; 

7: end if 

8: end for 

9: "H/C^+i) = VF^^''); 

10: Compute //('^+^) from H^^> and W^^~^^' using a symmetrically adapted version of steps 2-9; 
11: end for 
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Figure 5: Average of functions E{t) for MU using different values of a and e: (left) dense matrices, 
(right) sparse matrices. Same settings as in Figure [H 



In the dense case, this safeguard procedure slightly improves performance. We also note that the 
best values of parameter a now seem to be higher than in the unsafeguarded case (a = 2 versus 
a = 1 for MU, and a = 1 versus a = 0.5 for HALS). Worse performance of those higher values of a 
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Figure 6: Average of functions E{t) for HALS using different values of a and e: (left) dense matrices, 
(right) sparse matrices. Same settings as in Figure [TJ 



in the unsafeguarded scheme can be explained by the fact that additional inner iterations, although 
sometimes useful, become too costly overall if they are not stopped when becoming ineffective. 

In the sparse case, the improvement is rather limited (if not absent) and most accelerated variants 
provide similar performances. In particular, as already observed in Sections 13. Il and l3. 21 the accelerated 
variant of HALS does not perform very differently from the original HALS on sparse matrices. We 
explain this by the fact that HALS applied on sparse matrices is extremely efficient and one inner 
update already decreases the objective function significantly. To illustrate this. Figure \7\ shows the 
evolution of the relative error 



E''{1) 



\M - W'^^''-^H^''^\ 



\M - PF(*:.o)iJ(^)| 



of the iterate VF('^'') for a sparse matrix M, wherq^l e^in = niini4/>o \\M — WH^^^\\f- Recall that 
(VF' '^\ H^ ') denotes the solution obtained after k outer iterations (starting from randomly generated 
matrices). For k = 1 (resp. k = 20), the relative error is reduced by a factor of more than 87% (resp. 
97%) after only one inner iteration. 

3.4 Application to Lin's Projected Gradient Algorithm 

The accelerating procedure described in the previous sections can potentially be applied to many other 
NMF algorithms. To illustrate this, we have modified Lin's projected gradient algorithm (PG) (JLinl . 
2007al ) by replacing the original dynamic stopping criterion (based on the stationarity conditions) by 
the hybrid strategy described in Section 13.31 It is in fact straightforward to see that our analysis is 
applicable in this case, since Lin's algorithm also requires the computation of HH^ and MH^ when 
updating W, because the gradient of the objective function in (jNMFp is given by VvkH-^ — VFif |||. = 
2WHH^ — 2Mi?-^. This is also a direct confirmation that our approach can be straightforwardly 
applied to many more NMF algorithms than those considered in this paper. 

Figure [5] displ ays t he cor responding computational results, comparing the original PG algorithm 
(as available from lLinI ( 2007al )) with its dynamic stopping criterion (based on the norm of the projected 
gradient) and our variants, based on a (safeguarded) fixed number of inner iterations. It demonstrates 



^We have used the active-set algorithm of lj.Kim fc ParkI l|2008f ) to compute the optimal value of the NNLS subproblem. 
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Figure 7: Evolution of the relative error E {I) of the iterates of inner iterations in MU and HALS, 
solving the NNLS subproblem miniy>o \\M — WH^'^'Wf with r = 40 for the classic text dataset (cf. 
Table ED. 



that our accelerated schemes perform significantly better, both in the sparse and dense cases (notice 
that in the sparse case, most accelerated variants perform similarly). The choice a = 0.5 gives the best 
results, and the safeguard procedure does not help much; the reason being that PG converges relatively 
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Figure 8: Average of functions E{t) for the projected gradient algorithm of iLirj ( 2007al ) . and its 
modification using a fixed number of inner iterations. Same settings as Figure [TJ 



slowly (we will see in Section [5] that its accelerated variant converges slower than the accelerated MU). 

4 Convergence to Stationary Points 

In this section, we briefly recall convergence properties of both MU and HALS, and show that they 
are inherited by their accelerated variants. 



10 



4.1 Multiplicative Updates 

It was shown bv IPaube-Witherspoon &: Muehllehneii ( 19861 ) and later bv lLee &: Seund (j 19991 ) that a 
single multiplicative update of W (i.e., replacing VF by VF o ,^^^j,. while H is kept fixed) guarantees 

that the objective function \\M — WH\\'jp does not increase. Since our accelerated variant simply 
performs several updates of W while H is unchanged (and vice versa), we immediately obtain that 
the objective function ||M — l^ff|||, is non-increasing under the iterations of Algorithm [3l 

Unfortunately, this property does not guarantee convergence to a st ationary po int of (|NMFp . and 
this question on the convergence of the MU seems to be still open, see iLinI ( 2007bl ) . Furthermore, in 
practice, rounding errors might set some entries m W or H to zero, and then multiplicative updates 
cannot modify their values. Hence, it was observed that desp i te the ir monotonicity, MU do not 
necessarily c onve r ge to a stationary point, see lGonzales &: Zhand (J2005l ). 

However, iLinI ( 2007bl ) proposed a slight modification of MU in order to obtain the convergence to 
a stationary point. Roughly speaking, MU is recast as a rescaled gradient descent method and the 
step l ength is modified accordingly. Another even simpler pos sibility is proposed by iGillis &: Glineur 
( 20081 ) who proved the following theorem (see also ( Gillid . l201ll . §4.1) where the influence of parameter 
S is discussed): 

Theorem 1 (JGilhs fc Ghneuil (jiooi)). For any constant 6 > 0, M > and an^ iW,H) > 6, 
\\M — WH\\f is nonincreasing under 



W 



max 



5 Wo [^^'h 



H 



max 



[W^M] ^ 



(4.1) 



where the max is taken component-wise. Moreover, every limit point of the corresponding (alternated) 
algorithm is a stationary point of the following optimization problem 



mm 
W>S,H>S 



\M -WH\ 



The proof of Theorem [T] only relies on the fact that the limit points of the updates ()4.ip are fixed 
points (there always exists at least one limit point because the objective function is bounded below 
and non-increasing under updates ()4.ip ). Therefore, one can easily check that the proof still holds 
when a bounded number of inner iterations is performed, i.e., the theorem applies to our accelerated 
variant (cf. Algorithm [3|) . 

It is important to realize that this is not merely a theoretical issue and that this observation can 
really play a crucial role in practice. To illustrate this. Figure O shows the evolution of the normalized 
objective function (cf. Equation ()3.ip ) using 6 = and 6 = 10~^® starting from the same initial 
matrices W^^' and H'^^' randomly generated (each entry uniformly drawn between and 1). We 
observe that, after some number of iterations, the original MU (i.e., with 5 = 0) get stuck while the 
variant with 6 = 10~^^ is still able to slightly improve W and H. Notice that this is especially critical 
on sparse matrices (see Figure [9l right) because many more entries of W and H are expected to be 
equal to zero at stationarity. For this reason, in this paper, all numerical experiments with MU use 
the updates from Equation ()4.ip with 6 = 10~^^ (instead of the original version with 6 = 0). 



4.2 Hierarchical Alternating Least Squares 

HALS is an exact block-coordinate descent method where blocks of variables (columns of W and 
rows of H) are optimized in a cyclic way (first the columns of W, then the rows of H, etc.). Clearly, 
exact block-coordinate descent methods always guarantee the objective function t o decrease. Howev er, 
convergence to a stationary point requires additional assumptions. For example, iBertsekasI (jl999al ]b[) 
(Proposition 2.7.1) showed that if the following three conditions hold: 



(W, H) > 5 means that W and H are component-wise larger than 5. 
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Figure 9: Functions E{t) for 5 = and 6 = 10 ^^ on the (dense) ORL face dataset (cf. Table [TJ and 
the (sparse) classic text dataset (cf. Table [2]) with r = 40. 



• each block of variables belongs to a closed convex set (which is the case here since the blocks of 
variables belong either to M™ or M" ), 

• the minimum computed at each iteration for a given block of variables is uniquely attained; 

• the function is monotonically nonincreasing in the interval from one iterate to the next; 

then exact block-coordinate descent methods converge to a stationary point. The second and the third 
requirements are satisfied as long as no columns of W and no rows of H become completely equal to 
zero (subproblems are then strictly convex quadratic programs, whose unique optimal solutions are 
given by the HALS updates, see Section [1]). In practice, if a column of M^ or a row of H b ecomes 
zer qi, we reinitialize i t to a small positive constant (we used 10~^^). We refer the reader to lHol (J2008. ) 
and lGillis Sz Glineud (2008]) for more details on the convergence issues related to HALS. 

Because our accelerated variant of HALS is just another type of exact block-coordinate descent 
method (the only difference being that the variables are optimized in a different order: first several 
times the columns of W, then several times the rows of H, etc.), it inherits all the above properties. 
In fact, the statement of the above-mentioned theorem in (JBertsekasl . Il999bl . p. 6) mentions that 'the 
order of the blocks may be arbitrary as long as there is an integer K such that each block-component is 
iterated at least once in every group of K contiguous iterations', which clearly holds for our accelerated 
algorithm with a fixed number of inner iterations and its hybrid varianl|2|. 



5 Numerical Experiments 

In this section, we compare the following algorithms, choosing for our accelerated MU, HALS and 
PG schemes the hybrid stopping criterion and the best compromise for values for parameters a and e 
according to tests performed in Section [3l 



1. (MU) The multiplicative updates algorithm of lLee &: Seungj (J200ll ). 



""In practice, this typically only happens if the initial factors are not properly chosen, see iHol (|2008 

''Note however that the accelerated algorithms based solely on the dynamic stopping criterion fSection l3.2p might not 

satisfy this requirement, because the number of inner iterations for each outer iteration can in principle grow indefinitely 

in the course of the algorithm. 
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2. (A-MU) The accelerated MU with a safeguarded fixed number of inner iterations using a = 2 
and e = 0.1 (cf. Algorithm [3D . 



3. (HALS) The hierarchical alternating least squares algorithm of lCichocki et alj (|2007l ) 



4. (A-HALS) The accelerated HALS with a safeguarded fixed number of inner iterations using 
a = 0.5 and e = 0.1 (cf. Algorithm [3D . 



5. (PG) The projected gradient method of JLinl (J2007al ) 



6. (A-PG) The modified projected gradient method of lLinI ( 2007al ) using a = 0.5 and e = (cf. 
Section [32 



7. (ANLS) The alternating nonnegative least squares algorithnjl of IJ.Kim &: ParkI ( 20081 ). which 
alternatively optimizes W and H exactly using a block-pivot active set method. Kim and Park 
showed that their method typically outperforms other tested algorithms (in particular MU and 
PG) on synthetic, images and text datasets. 

All tests were run using MATLAB® 7.1 (R14), on a 3GHz Intel® Core 2 dual core processor. 
We present numerical results on images datasets (dense matrices, Section 15. ip and on text datasets 
(sparse matrices. Section [5. 2p . Code for all algorithms but ANLS is available at 



http : //sites . google . com/site/nicolasgillis/ 



5.1 Dense Matrices - Images Datasets 

Table [1] summarizes the characteristics of the different datasets. 





Table 1: Image c 


atasets. 






Data 


# pixels 


m 


n 


r 


Ipw\ 


[ph\ 


ORL^ 


112 X 92 


10304 


400 


30, 60 


358, 195 


13, 7 


Umist^ 


112 X 92 


10304 


575 


30, 60 


351, 188 


19, 10 


CBCL^ 


19 X 19 


361 


2429 


30, 60 


12, 7 


85, 47 


Frey2 


28 X 20 


560 


1965 


30, 60 


19, 10 


67, 36 



[a;J denotes the largest integer smaller than x. 

^ http : //www . cl . cam . ac . uk/research/dtg/attarchive /f acedatabase . html | 

http: //www. cs .t pronto . edu/~roweis/data.html_ 
^|http://cbcl. mit.edu/cbcl/software-datasets/FaceDat a2.html I 



For each dataset, we use two different values for the rank (r = 30, 60) and initialize the algorithms 
with the same 50 random factors {W^^' , H^^>) (using i.i.d. uniform random variables on [0, l]]j. In 
order to assess the performance of the different algorithms, we display individually for each dataset 
the average over all runs of the function E(t) defined in Equation ()3.ip . see Figure [TOl 

First, these result s confirm what was already observed by pre vious works: PC per forms better 
than MU H, l2007al ). ANLS performs better than MU and PG d.T.Kim fc ParkI . l2008l l. and HALS 



'^Code is available at 'http : / /www . c c . gat e ch . edu/ ~ hpark/ ' 

^Generating initial matrices (W'''^'' ,H^''') randomly typically leads to a very large initial error e(0) 



\M 



W^'^'H^^'Wf- This implies that E{t) will get very small after one step of any algorithm. To avoid this large initial 
decrease, we have scaled the initial matrices s uch that argmin^ \ \M — aW^''^' H^''^'\\F~'i-; this simply amount to multiply- 



ing W and H by an appropriate constant, see lGillis fc Glineurl ([20081) . 
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Figure 10: Average of functions E{t) for different image datasets: ORL (top left), Umist (top right), 
CBCL (bottom left) and Frey (bottom right). 



performs the best ( Hoi . 120081 ) . Second, they confirm that the accelerated algorithms indeed are more 
efficient: A-MU (resp. A-PG) clearly outperforms MU (resp. PG) in all cases, while A-HALS is, by far, 
the most efficient algorithm for the tested databases. It is interesting to notice that A-MU performs 
better than A-PG, and only slightly worse than ANLS, often decreasing the error as fast during the 
first iterations. 



5.2 Sparse Matrices - Text Datasets 

Table [2] summarizes the characteristics of the different datasets. 

The factorization rank r was set to 10 and 20. For the comparison, we used the same settings as 
for the dense matrices. Figure \TT\ displays for each dataset the evolution of the average of functions 
E{t) over all runs. Again the accelerated algorithms are much more efficient. In particular, A-MU 
and A-PG converge initially much faster than ANLS, and also obtain better final solutionqj. A-MU, 
HALS and A-HALS have the fastest initial convergence rates, and HALS and A-HALS generate the 



We also observe that ANLS no longer outperforms the original MU and PG algorithms, and only sometimes generates 
better solutions. 
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Table 2: Text mining datasets 



Data 



classic 
sports 
reviews 
hitech 
ohscal 
lal 



m 



7094 

8580 

4069 

2301 

11162 

3204 



Zhong &: Ghoshl . l2005l ) (sparsity is given in %: 100 * #zeros/(mn)). 



n 



41681 
14870 
18483 
10080 
11465 
31472 



10, 20 
10, 20 
10, 20 
10, 20 
10, 20 
10, 20 



TT^nonzero 



223839 
1091723 
758635 
331373 
674365 
484024 



sparsity 



99.92 
99.14 
98.99 
98.57 
99.47 
99.52 



Lp 



w 



12, 9 
18, 11 
35, 22 
25, 16 

7,4 
31, 21 



[ph\ 



2, 1 
10, 6 

8,4 
5,4 
7,4 

3, 2 



best solutions in all cases. Notice that A-HALS does not always obtain better final solutions than 
HALS (still this happens on half of the datasets), because HALS already performs remarkably well 
(see discussion at the end of Section [3. 3|) . However, the initial convergence of A-HALS is in all cases 
at least as fast as that of HALS. 



6 Conclusion 

In this paper, we considered the mult iplicative upda. t es of 



alternating least squares algorithm of ICichocki et al.l (J2007l ). We introduced accelerated variants of 



Lee &: Seund ( 200ll ) and the hierarchical 



these two schemes, based on a careful analysis of the computational cost spent at each iteration, 
and preserve the convergence properties of the original algorithms. The idea behind our approach 
is based on taking better advantage of the most expensive part of the algorithms, by repeating a 
(safeguarded) fixed number of times the cheaper part of the iterations. This technique can in principle 
be applied to most NMF algo rithms; in p articular, we showed how it can substantially improve the 
projected gradient method of iLinI ( 2007al ). We then experimentally showed that these accelerated 
variants, despite the relative simplicity of the modification, significantly outperform the original ones, 
especially on de nse matrices, and com pete favorably with a state-of-the-art algorithm, namely the 
ANLS method of lJ.Kim &: ParkI ( 20081 ). A direction for future research would be to choose the number 
of inner iterations in a more sophisticated way, with the hope of further improving the efficiency of 
A-MU, A-PG and A-HALS. Finally, we observed that HALS and its accelerated version are the most 
efficient variants for solving NMF problems, sometimes by far. 
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